Paper #782:Identifying and Generating Easy Sets of Constraints For Clustering
نویسنده
چکیده
Clustering under constraints is a recent innovation in the artificial intelligence community that has yielded significant practical benefit. However, recent work has shown that for some negative forms of constraints the associated subproblem of just finding a feasible clustering is NP-complete. These worst case results for the entire problem class say nothing of where and how prevalent easy problem instances are. In this work, we show that there are large pockets within these problem classes where clustering under constraints is easy and that using easy sets of constraints yields better empirical results. We then illustrate several sufficient conditions from graph theory to identify apriori where these easy problem instances are and present algorithms to create large and easy to satisfy constraint sets. Introduction and Motivation Clustering is a ubiquitous unsupervised learning activity used within artificial intelligence for object identification in images, information retrieval and natural language understanding (Wagstaff et al. 2001). A recent innovation has been the introduction of clustering under instance level constraints which effectively provide hints to the composition of a desirable clustering of the instances. The most prevalent form of constraints are must-link (ML) where two instance must be in the same cluster and cannot-link (CL) where they are in different clusters. These two types of constraints offer the ability to incorporate strong background knowledge into the clustering process. For example when clustering automobile GPS trace information to form clusters that are traffic lanes (Wagstaff et al. 2001) the physical distance between lanes (4 meters) can be used to generate cannot-link constraints between instances. However, in practice constraints are typically randomly generated from labeled data. If two randomly chosen instances have the same (different) label a ML (CL) constraint is generated between them. The main uses of ML and CL constraints have been in the context of three classes of algorithms listed below. a) Algorithms that satisfy every constraint, such as COP-k-means (Wagstaff et al. 2001), b) algorithms that satisfy as many constraints as possible, such as PKM (Bilenko et al. 2004), and c) algorithms that learn a distance function (Xing et al. 2003) so that points which are part of a ML (CL) constraint are close (far) according to the distance function. Some algorithms have multiple objectives, such as MPKM (Bilenko et al. 2004) which attempts both b) and c). Department of Computer Science, University at Albany State University of New York, Albany, NY 12222. Email: [email protected]. Department of Computer Science, University at Albany State University of New York, Albany, NY 12222. Email: [email protected]. However, when clustering under constraints the feasibility sub-problem arises: Does there exist any solution that satisfies all constraints? For example, there is no possible clustering under the constraints ML(a, b), CL(a, b), but even for self-consistent constraints set such as CL(a, b), CL(b, c) and CL(a, c), there is no clustering for k ≤ 2. Formally: Definition 1. The Feasibility Problem. Given a set D of data points, a collection C of ML and CL constraints on some points in D, upper (Ku) and lower bounds (Kl) on the number of clusters, does there exist at least one partition of D into k clusters such that Kl ≤ k ≤ Ku and all constraints in C are satisfied? If this question can be answered efficiently, then one can generate feasible clustering at each iteration of a clustering under constraints algorithm. Previous work (Davidson & Ravi, 2005a) has produced worst case results for the feasibility problem for clustering under ML and CL constraints amongst others. The feasibility problem for clustering under ML constraints is in P while clustering under CL only and ML and CL is NP-complete, as can be shown by a reduction from graph coloring. It is tempting then to abandon the use of CL constraints, however, the transitivity and entailment property of ML and CL constraints respectively (Wagstaff et al. 2001) can result in many entailed CL constraints and hence are quite useful. For example, the constraints ML(a, b), ML(a, c), ML(d, e), CL(a, e) entail the additional constraints ML(a, c), CL(a, d), CL(b, d), CL(b, e), CL(c, d) and CL(c, e). The worst case complexity results just make a general statement about the feasibility problem, namely that it contains a core of difficult problem instances. Of more practical importance to users and A.I. researchers are more pragmatic questions such as how does one apriori identify an easy problem instance and how does one generate them? How we define an easy problem instance is important. An idealized definition would of course be a set of constraints where there is at least one feasible solution. However, since the feasibility problem is NP-complete for CL constraints, testing any necessary and sufficient condition to apriori identify feasible constraint sets cannot be carried out efficiently so long as P 6= NP. Furthermore, knowing that a constraint set has a feasible solution does not make it easy to find the solution. As we shall see later, the ordering of the instances in D plays an important role for most iterative style clustering algorithms such as k-means and EM which assign instances to clusters in a fixed pre-determined order. Therefore, we will adopt the definition that a problem instance is easy for a clustering algorithm if a feasible solution can be found given an ordering of the instances in the data set which in turn determine the order of how the instances will be assigned clusters. Formally: Figure 1: A graphical representation of the feasibility problem for ML(a,b), ML(a,c), ML(d,e), ML(f,g), ML(h,i), ML(j,k), CL(a,e), CL(i,j), CL(d,k), CL(e,l)
منابع مشابه
Generating Optimal Timetabling for Lecturers using Hybrid Fuzzy and Clustering Algorithms
UCTTP is a NP-hard problem, which must be performed for each semester frequently. The major technique in the presented approach would be analyzing data to resolve uncertainties of lecturers’ preferences and constraints within a department in order to obtain a ranking for each lecturer based on their requirements within a department where it is attempted to increase their satisfaction and develo...
متن کاملA novel local search method for microaggregation
In this paper, we propose an effective microaggregation algorithm to produce a more useful protected data for publishing. Microaggregation is mapped to a clustering problem with known minimum and maximum group size constraints. In this scheme, the goal is to cluster n records into groups of at least k and at most 2k_1 records, such that the sum of the within-group squ...
متن کاملIdentifying and Generating Easy Sets of Constraints for Clustering
Clustering under constraints is a recent innovation in the artificial intelligence community that has yielded significant practical benefit. However, recent work has shown that for some negative forms of constraints the associated subproblem of just finding a feasible clustering is NP-complete. These worst case results for the entire problem class say nothing of where and how prevalent easy pro...
متن کاملWised Semi-Supervised Cluster Ensemble Selection: A New Framework for Selecting and Combing Multiple Partitions Based on Prior knowledge
The Wisdom of Crowds, an innovative theory described in social science, claims that the aggregate decisions made by a group will often be better than those of its individual members if the four fundamental criteria of this theory are satisfied. This theory used for in clustering problems. Previous researches showed that this theory can significantly increase the stability and performance of...
متن کاملWised Semi-Supervised Cluster Ensemble Selection: A New Framework for Selecting and Combing Multiple Partitions Based on Prior knowledge
The Wisdom of Crowds, an innovative theory described in social science, claims that the aggregate decisions made by a group will often be better than those of its individual members if the four fundamental criteria of this theory are satisfied. This theory used for in clustering problems. Previous researches showed that this theory can significantly increase the stability and performance of...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006